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ABSTRACT 

Motivation: Genome-wide haplotype reconstruction from sequence 
data, or haplotype assembly, is at the center of major challenges in 
molecular biology and life sciences. For complex eukaryotic organ- 
isms like humans, the genome is vast and the population samples are 
growing so rapidly that algorithms processing high-throughput 
sequencing data must scale favorably in terms of both accuracy and 
computational efficiency. Furthermore, current models and methodol- 
ogies for haplotype assembly (i) do not consider individuals sharing 
haplotypes jointly, which reduces the size and accuracy of assembled 
haplotypes, and (ii) are unable to model genomes having more than 
two sets of homologous chromosomes (polyploidy). Polyploid organ- 
isms are increasingly becoming the target of many research groups 
interested in the genomics of disease, phylogenetics, botany and evo- 
lution but there is an absence of theory and methods for polyploid 
haplotype reconstruction. 

Results: In this work, we present a number of results, extensions and 
generalizations of compass graphs and our HapCompass framework. 
We prove the theoretical complexity of two haplotype assembly opti- 
mizations, thereby motivating the use of heuristics. Furthermore, we 
present graph theory-based algorithms for the problem of haplotype 
assembly using our previously developed HapCompass framework for 
(i) novel implementations of haplotype assembly optimizations (min- 
imum error correction), (ii) assembly of a pair of individuals sharing a 
haplotype tract identical by descent and (iii) assembly of polyploid 
genomes. We evaluate our methods on 1000 Genomes Project, 
Pacific Biosciences and simulated sequence data. 
Availability and Implementation: HapCompass is available for down- 
load at http://www.brown.edu/Research/lstrail_Lab/. 
Contact: Sorin_lstrail@brown.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 



1 INTRODUCTION 

The genome sequence of a human individual can be modeled as 
23 pairs of sequences of four nucleotide bases, A, C, G and T, 
representing the 22 pairs of autosomes and the sex chromosomes. 
However, ~99.5% of any two individuals' genome sequences is 
shared within a population. The ~0.5% of the nucleotide bases 
varying within a population range from single-nucleotide poly- 
morphisms (SNPs) to more complex structural changes, for ex- 
ample, deletions or insertions of genomic material. A sequence of 
genomic variants, typically SNPs, with the non-varying DNA 
removed is referred to as a haplotype. 
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Standard genome sequencing workflows produce contiguous 
DNA segments of an unknown chromosomal origin. De novo 
assemblies for genomes with two sets of chromosomes {diploid) 
or more (polyploid) produce consensus sequences in which the 
relative haplotype phase between variants is undetermined. The 
set of sequencing reads can be mapped to the phase-ambiguous 
reference genome and the diploid chromosome origin can be 
determined but, without knowledge of the haplotype sequences, 
reads cannot be mapped to the particular haploid chromosome 
sequence. As a result, reference-based genome assembly algo- 
rithms also produce unphased assemblies. However, sequence 
reads are derived from a single haploid fragment and thus pro- 
vide valuable phase information when they contain two or more 
variants. The haplotype assembly problem aims to compute the 
haplotype sequences for each chromosome given a set of aligned 
sequence reads to the genome and variant information. The 
haplotype phase of variants is inferred from assembling overlap- 
ping sequence reads [Browning and Browning (2011); 
Halldorsson et al. (2003); Schwartz (2010)]. 

The input to the haplotype assembly problem is a matrix 
M whose rows correspond to aligned read fragments and col- 
umns correspond to SNPs (Fig. 1). The quality of Afs construc- 
tion depends on the parameters of the sequencing workflow and 
the accuracy of the read alignment algorithms. Misaligned read 
fragments can introduce erroneous base calls or sampling biases 
so the careful alignment of sequence reads is necessary for high- 
quality haplotype assemblies. Without read alignment or sequen- 
cing errors, the haplotype assembly problem can be solved in 
time linear in the size of M by partitioning the fragments in 
two sets whereby no fragments internal to a set share an SNP 
and differ in the allele called. To address erroneous base calls or 
misplaced alignments, three primary haplotype assembly opti- 
mizations have been developed: minimum error correction 
(MEC), minimum SNP removal (MSR) and minimum fragment 
removal (MFR). The goal is to convert M into a state such that 
the fragments (rows of M) can be distributed into two sets cor- 
responding to the two haplotypes. All fragments in a set must 
agree on the allele at each SNP site and this is accomplished 
using the minimum number of SNP allele flips (0 to 1 or vice 
versa - MEC), SNP (columns of M) removals (MSR) or frag- 
ment (rows of M) removals (MFR). 

Lancia et al. (2001) and Rizzi et al. (2002) provide a theoretical 
foundation for the MFR and MSR optimizations and describe 
the fundamental SNP and fragment conflict graph structures. 
The first widely available haplotype assembly software package 
was presented in Panconesi and Sozio (2004) in which the au- 
thors describe the Fast Hare algorithm, which optimizes the 'Min 
Element Removal' problem. Bansal et al. (2008) describe a 
Markov chain model with Metropolis updating rules to sample 
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Fig. 1. Construction of the input to the haplotype assembly problem 

a set of likely haplotypes under the MEC optimization. In a 
follow-up, the authors present a much faster algorithm on a 
related graph model that relates maximum cuts to SNP allele 
flips (in the MEC model) [Bansal and Bafna (2008)]. Still other 
authors have suggested reductions to the well-known maximum 
satisfiability problem [He et al. (2010); Mousavi et al. (201 1)] The 
Levy et al. (2007) algorithm is a well-known heuristic that was 
used to haplotype assemble the HuRef genome; it assigns frag- 
ments to haplotypes in a greedy fashion and iteratively refines 
the solution by comparing the set of fragments to the assembled 
haplotypes using majority rule phasings. In a recent survey, 
Geraci (2010) describes the Levy et al. (2007) algorithm as, ar- 
guably, the best performing algorithm tested. 

The first extension of the haplotype assembly problem that 
addressed the simultaneous assembly of multiple diploid 
chromosomes was presented in Li et al. (2006); however, the 
benefits of multi-haplotype assembly are not clear for a set of 
unrelated individuals. Halldorsson et al. (2011) continued devel- 
opment of this theory by describing methods for assembling in- 
dividuals who share a haplotype identical by descent (IBD) using 
relationships among the reads. 

Aguiar and Istrail (2012) introduced a new graph data struc- 
ture, algorithmic framework and the minimum weighted edge 
removal (MWER) optimization, which together have several ad- 
vantages over existing methods. Recall that the rows of M cor- 
respond to sequence read fragments with the non-polymorphic 
bases removed such that only SNPs remain. The HapCompass 
model defined in Aguiar and Istrail (2012) is composed of the 
compass graph G c core data structure, which summarizes the 
rows of M using edges weights and the MWER optimization 
that aims to remove a minimum weighted set of edges from G c 
such that a unique phasing may be constructed. The algorithm 
operates on the spanning-tree cycle basis of G c to iteratively 
remove errors that are manifested through a particular type of 
simple cycle [Deo et al. (1982); Mac Lane (1937)]. 

In this work, we prove a number of theoretical results for the 
previously described MWER optimization on compass graphs. 
The main result proves MWER is NP-hard and motivates the 
use of our heuristic algorithms. Further, we demonstrate how 
extensions to the generalized diploid HapCompass model can 
enable (i) usage of different optimizations, for example, MEC 
and MWER, to be used in the local optimization step, (ii) sim- 
ultaneous assembly of two individuals sharing a haplotype tract 
IBD and (iii) haplotype assembly of a single polyploid organism. 
Finally, we evaluate our methods on 1000 Genomes Project, 
Pacific Biosciences and simulated data. 



2 METHODS 

Let a fragment / be a sequence read with the non-polymorphic bases 
removed such that only SNPs remain. Fragments may be either a 
single contiguous region of DNA or contain any number of gaps between 
contiguous regions (for example, one gap between two contiguous regions 
in paired-end sequencing). Each SNP must be heterozygous and each row 
must cover at least two SNPs to be able to extract useful haplotype phase 
information from sequence reads. An SNP allele is encoded as 0 or 1 
corresponding to the major or minor allele. The k base of the i frag- 
ment is referred to as fl^. If fl does not include the base k in the sequence 
read (within the gap of a paired-read, for instance), then f^ ='-'. Let M 
be the m x n SNP-fragment matrix with m rows corresponding to the m 
fragments and n columns corresponding to n SNPs. Two fragments J] and 
fi are in fragment conflict if 

3*l/tJ :#J5.* :A Aa# , - , Aj5- k #«- 1 (1) 

Informally, fragment conflict represents two fragments that include the 
same SNP but differ in the allele. The fragment conflict graph Gf has a 
vertex for each fragment in M and an edge between two fragments if they 
are in fragment conflict. M is feasible if a bipartition exists in G F or, 
equivalently, the fragments of M can be partitioned in two sets such 
that no two fragments within each set are in fragment conflict. 

2.1 The MWER optimization and HapCompass models 

The HapCompass model defined in Aguiar and Istrail (2012) is composed 
of the compass graph G c data structure and optimizations on the span- 
ning-tree cycle basis of this graph. Gc is a graph with a vertex for each 
SNP and an edge between two SNPs if at least one read contains both 
SNPs (Fig. 2). The weight on the edge, defined by the function J'mwer, is 
the difference between the number of fragments that suggest a ™ phasing 
and the number of fragments that suggest °q. 

A path in Gc corresponds to a phasing of the SNP vertices by con- 
catenating the phasings on the edges. For example, the (s\, sj), (s2,sa) 
path in Figure 2 corresponds to the concatenation of the ®' B phasing 
with the °„ phasing, yielding the "J" phasing. A number of subtle com- 
binatorial properties of the diploid read information define the contiguity 
of the assembly; in haplotype assembly of polyploid genomes where more 
than two haplotypes exist, these combinatorial properties will be made 
explicit and generalized as a basis for the development of polyploid 
haplotype assembly algorithms. 

A spanning tree in Gc corresponds to a valid phasing of the SNPs of 
Gc- Simple cycles in Gc have the property of being non-conflicting, 
whereby every path in the cycle including the same set of vertices corres- 
ponds to the same phasing, or conflicting, whereby there is no unique 
phasing. Aguiar and Istrail (2012) show that a simple cycle is conflicting if 
and only if there is a 0-weight edge or an odd number of negative weight 
edges and non-conflicting otherwise. For the MWER optimization, the 
HapCompass algorithm constructs a spanning-tree cycle basis of G c and 
removes edges of small weight (in absolute value) from conflicting cycles 
until Gc is void of conflicts. 

The generalized HapCompass model described in this work supports 
multiple optimizations on compass graphs, joint haplotype assembly of 
individuals sharing a haplotype tract IBD and haplotype assembly of 
polyploid organisms. To support these algorithmic extensions, we exam- 
ine key concepts of the HapCompass model and describe their 
generalizations. 

The core of the HapCompass framework constructs the compass graph 
Gc, a spanning-tree cycle basis of Gc, and then corrects conflicting cycles. 
One such method for correcting conflicting cycles was presented in Aguiar 
and Istrail (2012) where edge weights are used to compute a set of edges 
whose removal would eliminate conflicting cycles (the MWER optimiza- 
tion). In principle, other methods may be used to remove edges, or entirely 
new optimizations may be employed, for example, MEC. Specifically, we 
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Fig. 2. Construction of the compass graph from SNP-fragment matrix 
M. The SNP-fragment matrix M (left) contains four fragments and four 
SNPs. Each SNP's pairwise phasing relationship defined by the fragments 
is represented on the edges of the compass graph (right). The majority 
rule phasing for one of the haplotypes is shown in red on the compass 
graph edges 

implement an algorithm for the MEC optimization on compass graphs. 
However, before an implementation of an MEC algorithm on compass 
graphs can be realized, the HapCompass framework must be generalized 
to allow for corrections to fragments. 

Concept 1: edge weights. The HapCompass framework proposed in 
Aguiar and Istrail (2012) defines edge weights as the difference between 
the number of reads indicating the and phasings. The generalized 
model includes a vector for edge e, v e , consisting of four integers corres- 
ponding to the four possible haplotypes between two SNPs: 00, 01, 10, 11. 
A function, /(e), maps the vector to a meaningful value interpreted by the 
HapCompass algorithm. For example, in the MWER HapCompass al- 
gorithm, fMWER(e) = v e [0] + v c [3] - v c [l] - v,,[2] where v e [t\ is the count 
of the phasings 00, 01, 10, 11 for /' = 0, 1,2, 3, respectively. 

2.2 An MEC HapCompass optimization 

The MEC optimization on Gc aims to flip the minimum number of alleles 
such that all of the cycles are non-conflicting. The MEC algorithm pro- 
ceeds by building a spanning tree cycle basis of the compass graph. The 
following steps are repeated until each edge is non-conflicting, (i) For 
each edge e in the set of conflicting cycles: let v\ and v 2 be the two vertices 
adjacent to e. (ii) If fMtvEn(e)<0, we check the fragments that include 
both vi and v 2 , and temporarily flip the fragment alleles of V\ (v 2 in 
following iteration) to indicate ™ phasings. The other alleles in the frag- 
ments cause edges adjacent to Vj (v 2 ) to change weight as well. We record 
the number of conflicting cycles resolved and created by checking each 
cycle in the cycle basis including an edge that was modified by the flipping 
of a fragment allele, (iii) The case of /mich(c)>0 is handled analogously 
with the exception of flipping the alleles to indicate ^ phasings. (iv) Let 
the number of conflicting cycles resolved by processing e be c Ci r and the 
number of conflicting cycles created be c e<c . If max^ e (c eir — c eiC ) < 0, then 
there does not exist a favorable switching of fragment alleles and an edge 
is removed following the MWER algorithm. Otherwise, the fragment 
changes giving max Vt ,(c e-r — c CjC ) < 0 are introduced in G c - (v) When 
all cycles are non-conflicting, we output the phasing defined by any 
spanning tree. 

The primary data structure change in Gc introduces a mapping of edges 
to fragments. The primary addition to the HapCompass framework is a 
definition of optimization function to remove conflicting cycles from G c - 

2.3 IBD tracts and haplotype assembly 

Thus far, the HapCompass framework has only been defined for a single 
diploid individual. The generalization of haplotype assembly to multiple 



genomes must be selective for which individuals to assemble jointly. For 
example, if two individuals do not share a haplotype by descent, one 
individual's set of reads does not provide any information for the 
other. However, when two individuals do share a haplotype by descent, 
the shared haplotype provides phasing information across homozygous 
sites as long as one individual remains heterozygous (Fig. 3). Regions of 
homozygosity in an individual, which would otherwise disconnect SNPs 
and partition haplotype solutions, can be phased together as long as the 
jointly assembled genotype has heterozygous SNPs within the interval. 

Concept 2: multiple genotypes. The problem of joint assembly of two 
individuals who share a haplotype IBD (hereafter referred to as a pair) is 
different from jointly assembling two individuals who do not share a 
haplotype. In the compass graph, two unrelated genotypes have the 
effect that both individuals can be heterozygous but have completely 
different phasings. However, if they share a haplotype, a transition 
from a doubly heterozygous SNP to another doubly heterozygous SNP 
forces exactly two phasings, namely f\ or ^ (for example, SNP transitions 
(1,2) and (4,5) in Fig. 3). For the doubly heterozygous to singly hetero- 
zygous transitions, we may have exactly three of the four possible 2-SNP 
haplotypes. In Figure 3, the child's genotype is 22 122 and to phase this 
block using the child's data alone, we require a read to cover at least one 
of the first two SNPs and at least one of the last two SNPs, which may be 
impossible depending on the distance between the SNPs and sequence 
read insert length. However, if we assemble the parent with the child, we 
can use the shared haplotype to decode the parent's phase across SNPs 2, 
3 and 4 to be ™. Because they share a haplotype, the 1 1 1 haplotype must 
be the shared haplotype and it can be inferred that the child's phased 
haplotypes are ?o i m - 

Joint haplotype assembly in HapCompass is thus encoded as follows. 
Each edge now has two sets of vectors corresponding to the 2-SNP haplo- 
type transitions of the parent and child. For a doubly heterozygote to 
doubly heterozygote transition, the weight function can be computed as 
before using the coverage from both individuals (because there are exactly 
two disjoint phasings). For a singly heterozygote to doubly heterozygote 
transition (or vice-versa), the weight function can solely use the hetero- 
zygous-heterozygous transmission data from a single individual. 

2.4 Haplotype assembly of polyploid genomes 

The research literature concerning polyploid haplotype assembly is essen- 
tially non-existent. The analysis of /c-ploid genomes (k sets of chromo- 
somes) has been hindered by the complexity of sequencing and 
assembling k chromosomes concurrently. With high-throughput sequen- 
cing technologies, genotype inference in polyploid organisms is manage- 
able; sequence reads are mapped to a reference genome, and the relative 
quantities of alleles at an SNP can be inferred from sequence coverages. 
However, the basic assumption that there exists exactly two phasing be- 
tween two SNPs no longer holds. We note that the polyploidy assembly 
problem is similar to a number of problems in other areas of haplotype 
reconstruction (when the number of haplotypes is known or unknown) 
such as modeling metagenomics (organism identification), HIV (viral 
quasispecies identification in the 'metagenome' of patients), cancer 
(tumor and plasma) and epigenetics (regulatory region methylation re- 
construction similar to 'probabilistic haplotype' inference). 

Concept 3: uniqueness and disjoint phasings. One difficulty of polyploid 
haplotype assembly emerges from the non-disjointness of phasing solu- 
tions between SNPs. With the assumption that SNPs are biallelic, at least 
one haplotype will be shared by two or more phasings between two SNPs. 
In the diploid case, a read suggesting the 00 phasing could be interpreted 
as evidence for 1 1 on the other haplotype (uniqueness of phasing) and 
also evidence contradicting the °' 0 phasing (disjointness of haplotypes in 
phasing solutions). In the tetraploid case, for example, if the genotype for 
each of 2 SNPs is {0,0, 1, 1) then there exists three possible haplotype 
phasings: (00,00,11,11), (01,01,10,10), (00,01,10,11). 
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Fig. 3. A graph of the haplotype transitions defined by the majority rule 
phasings of a compass graph. SNPs 1, 2, 3, 4 and 5 (left to right) are 
shown with both alleles (vertices), and edge transitions are encoded by a 
specific type of line depending on whether the haplotype is shared IBD or 
unique to the child or parent. The genotype of the parent and child are 
22222 and 22 122, respectively (where the two corresponds to the hetero- 
zygote and 0 and 1 correspond to homozygous for the major and minor 
alleles, respectively) 



In general, the number of haplotype phasings on an edge is a function 
of the ploidy of the organism and the alleles at each SNP. As in the 
diploid case, each SNP must have at least one of each allele or else the 
SNP is homozygous and sequence observations of an allele do not pro- 
vide any phasing information. As a result, every 2-SNP haplotype in- 
cludes either "° or [J. 

However, unlike in the diploid case, the extension from one edge in G c 
to the next may not be deterministic. For example, in diploid assembly, if 
a reads suggest a ™ phasing for SNPs 1 and 2, and a ™ phasing for SNPs 
2 and 3, the extension would give us a phasing of™". A conflicting cycle 
in G c could then be generated if reads connecting SNPs 1 and 3 disagreed 
with this phasing. For the polyploid case, if the genotypes for each of 
SNP 1 and 2 are (0,0,1), then both the (00,00,11) phasing and (00,01,10) 
phasing are valid. Assume that we can compute the phasings between 
SNPs 1 and 2 and SNPs 2 and 3 to be (00,00,1 1); we can extend as we did 
in the diploid case to create the phasing (000,000,111). Then, if a read 
suggests a 01 phasing between SNPs 1 and 3, we again generate a con- 
flicting cycle. However, if the SNPs were phased using (00,01,10) for 
SNPs 1,2 and 2,3, then either phasing (000,010,101) or (001,010,100) is 
possible. Both are completely valid phasings consistent with the genotype 
and read data but fragments connecting SNPs 1 and 3 may constrain the 
phasing solution to be unique. 

Concept 4: polyploid edge decidability. The polyploid HapCompass 
model retains the axiom that each edge is decidable; that is, each edge 
has a unique and computable phasing as defined by the reads. The com- 
pass graph and spanning tree cycle basis is built from the input genotypes 
and reads as before. The distribution of haplotype configurations be- 
tween two SNPs are defined by the genotypes, and a singular configur- 
ation is computed using the available read data. The first approach 
attempts to assign reads into haplotype bins that represent the haplotype 
distribution for a valid phasing between two SNPs. Given a 2-SNP geno- 
type, a binning is an assignment of reads to haplotypes. For example, if 
two SNPs both had two 0 alleles and two 1 alleles, there would exist three 
haplotype phasings: (00,00,11,11), (01,10,11,00), (01,10,01,10), each with 
4 bins. The phasing (00,00,1 1,1 1), for instance, would contain two 00 bins 
and two 1 1 bins. 

Greedy binning algorithm. Input: a maximum distance d between any 
two bins, a set of haplotype phasings P and a set of reads R. Output: the 
haplotype phasing most supported by the reads. 

(1) For each haplotype phasing p e P 

(2) For each haplotype bin b e p, do steps (3-5). 

(3) Loop through steps (4-5) until all read fragments have been 
assigned. 

(4) Select a read r e R such that the edit distance between r and an 
available haplotype bin /; 6 b is minimal. 

(5) Place r in the selected bin h and remove this read from the read set. 



(6) Report the haplotype phasing with the binning of minimum total 
edit distance as the optimal phasing. 

We enforce that the difference of haplotypes in each bin must be at 
most d haplotypes to avoid always preferring diverse haplotype phasings 
[e.g. (10,10,01,01) versus (00,11,10,01)]. This condition defines which 
haplotype bin is available during each iteration. 

Probabilistic binning algorithm. Alternatively, probabilities of each 
phasing given the set of reads can be computed and uncertainty can be 
accounted for when extending phase to adjacent edges. In particular, we 
wish to compute the likelihood of a phasing given the set of input sequence 
reads. Let be the f h phasing for two adjacent SNPs, P the set of all 
possible phasings for the two SNPs, ;- ( be the/' 1 read and s e the probability 
of a sequencing error. Then, the likelihood of a particular phasing p p is 



L(p p \s e ,r\,r 2 , ...,r„) ■■ 



P(n,n 



,r n \s e ,p p ) 



EP(rur2,...,r n \ Se ,Pi) 

i=l 

P(n\s e ,p p ) ■ P(r 2 \s e ,p„) ■ ■ ■ P(r n \s e ,p p ) 



,r n \s e ,Pi) 



which may be computed using the assumption that sequence reads are 
independent. The probability of a read r, given sequencing error s e and 
phasing p can be computed by marginalizing over all possible haplotypes 
h sampled for phasing /;: 



hep 



(2) 



Thus, the edge is decisive for the haplotype phasing with the maximum 
likelihood for all reads that span the two SNPs. The original diploid 
scoring scheme can be recreated with a manipulation of the unnormalized 
phasing likelihoods: £Li P( n \s e = 0,h =JJ) - £Li P(r,\s e = 0,h =>°). 

Concept 5: conflicting cycles and phase extensions. Both the greedy and 
probabilistic binning algorithms decide the haplotype phase of edges. In 
the diploid case, the extension of phasings from edges to paths was un- 
ambiguous because for each of the two phasings, exactly one haplotype 
begins with 0 (or 1) and exactly one haplotype ends with 0 (or 1). 
Therefore, the computation of phasings for paths and conflicting cycles 
was easily determined given the decided edges. In polyploid genomes, 
each SNP variant in G c is still assumed to have only two possible alleles 
but each edge has three or more haplotypes. When extending phase from 
one edge to an adjacent edge, the haplotypes on different edges that share 
an allele can be used for extending phase. If this allele is present in k 
haplotypes, then there are k\ possible extensions. 

Phase extension algorithm. We introduce the chain graph G/„ which is 
defined on a path or cycle in G c for a i-ploid genome. Let 
(ci , <?2. ■■■,<"/)= P denote a path of edges in G c of length /. Each edge 
e, is phased (by the greedy or probabilistic method) and each haplotype in 
the phasing introduces a vertex in Gi, at level i. Thus, G/ t contains k 
vertices for each e, 6 p and a total of / ■ k vertices in total. Two haplotype 
vertices are connected by an edge if and only if they share an SNP pos- 
ition and allele. Because haplotypes at adjacent levels uniquely share an 
SNP position in G/„ edges only exist between adjacent levels and a path 
through the chain graph corresponds to a joining (or extension) of haplo- 
types. Therefore, there is always a valid phasing for a G/, defined on a 
path of G c - 

Cycles introduce complexity in G/,. Gi, defined on a cycle retains the 
characteristics of the path chain graph, but also includes source and sink 
nodes: si , . . . , st and t\ , .. .,tk, respectively. Let (e\ , e 2 , ...,«/, ei) = p 
denote any path of edges in Gc of length / with the addition of the 
(e;, e\ ) edge. Source nodes are connected arbitrarily to haplotypes on 
level 1 but haplotypes on level / are only connected to sink nodes if the 
shared variant position agrees with the haplotype the source was con- 
nected to; for example, in Figure 5 Top, t2 is connected to both 00 
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Fig. 4. Compass graphs Gc, g , a non-conflicting polyploid cycle (left), and 
Gc.c, a conflicting polyploid cycle (right). The vector on the edge corres- 
ponds to the haplotype counts for an edge in the format [00,01,10,11]. In 
both compass graphs, the haplotypes are 000, 000 and 111, while the 
reads in Gc, g are 000, 000 and 111, and the reads in Gc,c are 00-, 01-, 
10-, -00, -00, -11, 0-0, 0-0 and 1-1 



haplotypes at level / because s2 is connected to a haplotype starting with 
0. The sources and sinks represent the (e/, e\) edge and a path from s t to t t 
represents one valid haplotype. This intuition enables the formulation of 
the k vertex disjoint paths problem on chain graphs. If there exists k 
vertex disjoint, s t to /, paths for i = 1, ,.,,k, we have k valid phasings 
for the cycle; otherwise, the cycle is conflicting and there is no valid 
phasing. 

To further build intuition, consider a conflicting cycle of G c and G/, in 
the diploid case. A cycle was conflicting if the number of negative 
weighted edges in Gc was odd. Relating this to the chain graph G/,, an 
Si node would be connected to a 0 (or 1) and each negative edge would flip 
the next bit. So, a conflicting cycle has an odd number of negative edges, 
which translates into an odd number of bit flips resulting in no s, to 
path for i = 1,2. Figure 4 gives an example of non-conflicting and con- 
flicting cycles in polyploid compass graphs and Figure 5 their chain 
graphs. 

Gi, enables the (i) determination of conflicting cycles and (ii) compu- 
tation of the phased haplotypes for a path or cycle using disjoint paths. 
The ^'-disjoint paths problem is a well-studied optimization in the field of 
discrete mathematics [Robertson and Seymour (1995)]. A polynomial- 
time solution is known to exist for the node disjoint paths problem 
when k is known as part of the input [Robertson and Seymour (1995); 
Kawarabayashi et al. (2012)], but these algorithms require manipulation 
of enormous constants rendering them difficult to implement in practical 
settings. 

Fortunately, the structure of G/, enables a much more efficient solution 
to the problem. All paths from to f, can be computed by a modified depth 
first search algorithm. A depth first search is started from each source s, 
and the path from source to the current node is stored. Each node contains 
a list of integers initially empty. When the algorithm either encounters the 
sink node t h or a node already labeled with i, all nodes on the current path 
have )' added to their list. After each source-sink pair is processed, each 
node contains a label ;' if there is an s ; to t, path that includes the node. The 
runtime of this algorithm is O(kve) where k is the ploidy, and v and e are the 
number of vertices and edges in G/,, respectively. 

After all nodes are labeled, we iterate through each level of G/, and 
create an auxiliary flow graph G' h where / is the level. G' h defines a bipart- 
ite graph where one set of vertices corresponds to the source haplotype 
paths, which are connected to a set of vertices corresponding to the 
haplotypes of the phasing level /. A flow in G' h of total value k where 
each edge has capacity 1 corresponds to a maximum matching and thus a 
valid assignment of haplotype paths to haplotypes of the phasing at level 
/. This flow can be found in time linear in the size of the edge set of G l h . If 
every level of the chain graph has a valid bijection, then the cycle is non- 
conflicting and the path given by the matchings define a valid phasing. 
Figure 6 give an example of the auxiliary flow graphs for level 1 of the 
chain graphs defined in Figure 5. 
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Fig. 5. The chain graphs (top) G;, -g and (bottom) G; 1-c corresponding to 
Gc, g and Gc, c , respectively 




Fig. 6. The auxiliary flow graphs (top) G) t and (bottom) G], c . For a k- 
ploid organism (in this case £ = 3), a flow of k with 1 capacity on each 
edge corresponds to a valid assignment of haplotype paths to haplotypes 
of the phasing a level 1 

3 RESULTS 

3.1 Theoretical 

We first present results on the complexity of the MWER opti- 
mization, and related minimum weighted vertex (SNP) removal 
(MWVR) problems on the compass graph G c - These results mo- 
tivate the usage of our heuristics for the diploid and polyploid 
algorithms. Let L c Vc be a subset of vertices in G c , and let G' c 
be the resulting graph created from removing L from V c . The 
MWVR optimization aims to compute an L such that the fol- 
lowing conditions are satisfied: 

® H{s,}£L I w ( s i)\ i s minimal where w(si) is the weight of the i 
SNP (cost of removed vertices is minimal); 

(ii) All edges in G' c are decisive (each edge has a majority rule 
phasing); 

(iii) Choosing a phasing for each edge in G' c by majority rule 
gives a unique phasing for G' c . 
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We omit the straightforward proofs that the MWVR and 
MWER problems are in NP. It remains to be shown that known 
NP-hard problems can be reduced to MWVR and MWER. 

We restate the conflict graph generality lemma from Lippert 
et al. (2002). 

Lemma 1 . Let G = (V,E) be an arbitrary graph. Then there 
exists an SNF '-fragment matrix M such that Gf(M) = G. 

Proof: Introduce a fragment /■ for each vertex v, e V. For 
every two adjacent vertices {v,, v,} e E, introduce a new SNP 
column s k in M where/ /, = 0 and/} ( j = 1. 

Let M be the SNP-fragment matrix constructed from Lemma 
1, G F the corresponding fragment conflict graph of M and G c the 
compass graph of M. 

Lemma 2. Every simple cycle of odd length in G F produces 
exactly one conflicting simple cycle in Gc- 

For the proof of Lemma 2 please see Supplementary Appendix 
Proof of Lemma 2. 

Lemma 3. Every conflicting simple cycle in Gc includes exactly 
one odd length simple cycle in G F . 

Proof. We now interpret conflicting cycles in G c as a set of 
vertices of G c , which define a set of edges in G F . 

Because of the previous lemma, every conflicting cycle in G c can 
be resolved by removing an edge of G F , which corresponds to 
removing a vertex in G c - 

Corollary 1 . There exists no conflicting cycles in Gc if and 
only if there are no cycles of odd length in Gp. 

Lemma 4. Given an M produced from Lemma 1, the compass 
graph Gc(M) is the line graph of Gp(M) with weights of G c as 
defined by the phasing relationships of the fragments of M. 

Proof. The SNPs (columns) of M contain exactly two alleles 
from two fragments that conflict. Therefore, in G F , each SNP 
uniquely defines an edge, and in G c , each SNP uniquely defines a 
vertex. All that remains is to show that every two adjacent edges 
in G F produce an edge in G c - Consider an SNP .s* whose conflicts 
involve fragments / and f. The edge defined by s in G F is adja- 
cent to edges defined by the other conflicts of f and/. The vertex 
s in G c is defined exactly as the pairwise phasing relationships as 
defined by the SNP s and other SNP alleles in fragments/ and/, 
which in turn define the adjacencies in G F . 

Because G c is the line graph of G F , if k simple cycles in G c 
share an edge then k simple cycles in G F share a vertex. 

Theorem 1. MWVR is NP-hard. 

Proof: See the MWVR Proof section in Supplementary 
Appendix. 

Theorem 2. MWER is NP-hard. 

Proof: The reduction is from the problem of removing the 
minimum number of edges of a graph to make it bipartite. Let G 
be an arbitrary graph and M the SNP-fragment matrix as 
defined in Lemma 1. We modify Gf(M) by adding two add- 
itional degree 2 vertices to each edge, effectively converting 



each edge to a length 3 path. Cycles of odd (even) length 
retain their odd (even) length, thus odd length cycles still create 
conflicting cycles in G c . All vertices of degree k produce cliques 
of size k in G c , which do not correspond to any cycles in G F (M). 
Therefore, we label all edges of clique vertices produced from a 
single vertex with weight oo. All paths of G F will be encoded with 
two edges of G c \ both of which cannot be removed in an optimal 
solution to MWER. Given a solution to the MWER optimiza- 
tion, we can determine the minimum number of edges in G F to 
make it bipartite. 

3.2 Experimental 

We evaluate the HapCompass MEC, HapCompass IBD and 
polyploid HapCompass algorithms using 1000 Genomes 
Project [The 1000 Genomes Project Consortium (2010)], Pacific 
Biosciences and simulated data. 

Metrics. To evaluate the accuracy of our diploid haplotype 
assembly methods, we use the following measures, which capture 
different aspects of haplotype assembly quality. In Aguiar and 
Istrail (2012), we introduce the fragment mapping phase relation- 
ship (FMPR) distance, which counts the number of pairwise 
phase relationships (as defined by the input read fragments) 
that do not exist in the solution. The related boolean fragment 
mapping (BFM) distance counts the number of read fragments 
that do not map back to the solution. The third evaluation criteria 
we use is the MEC measure, which counts the number of allele 
flips in the fragments required to produce the phased haplotype 
assembly solution. In all previously described measures, lower 
values are desired. These metrics are similar to read mapping 
metrics in genome and transcriptome assembly, where good-qual- 
ity assemblies will allow for many reads to map back to them. 

3.3 Pacific biosciences data 

Single molecule sequencing has great potential to become a pre- 
ferred method for haplotype assembly, but current algorithmic 
techniques are untested on data with high error rates. We down- 
loaded the chromosome 20 data from individuals HG00321, 
HG00577, HG01101, NA18861, NA19313, NA19740, 
NA20296 and NA20800 [PacBio Data (2013)]. Haplotype assem- 
bly solutions were produced by HapCompass, Levy et al. (2007) 
and HapCUT to obtain the results in Table 1 (run times can be 
found in the Pacific Biosciences run times section in 
Supplementary Appendix). HapCompass outperforms the com- 
petition in terms of MEC using both optimizations. Interestingly, 
the Levy et al. (2007) algorithm is the most accurate in terms of 
FMPR and BFM. This is likely due to the Levy et al. (2007) 
algorithm processing entire read fragments each iteration while 
HapCompass focuses on correcting multiple fragments at adja- 
cent SNPs. Because the Pacific Biosciences read lengths are long 
(several kb), more emphasis is placed on matching reads with 
large overlaps on the same haplotype. This result further suggests 
that it is important to consider the input data and the desired 
results when preparing data for a haplotype assembly experiment. 

3.4 1000 genomes project data 

To further evaluate the HapCompass MEC implementation, we 
haplotype assembled the genome of 1000 Genomes Project 
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NA 12878 CEU child using our implementation of the Levy et al. 
(2007) method, HapCUT (v0.5) and the HapCompass MWER 
and MEC algorithms. Table 2 shows that the HapCompass 
MWER algorithm clearly performs best overall. The full table 
for all chromosomes is given in Supplementary Appendix (sec- 
tion 1000 Genomes Project Results). Surprisingly, even though 
the MWER algorithm does not directly optimize the MEC meas- 
ure, it produces the best haplotypes in respect to this measure for 
all but two chromosomes. 

3.5 IBD haplotype assembly 

Jointly assembling the haplotypes of related individuals has con- 
siderable benefits. The first benefit conies from the extra cover- 
age on the shared haplotype, which helps with differentiating 
true phasings from sequencing errors. However, the most notable 
advantage is being able to extend phasing past homozygous 
blocks. We compared the size of the phased haplotype blocks 
when assembling chromosome 22 of the NA12878 child in the 
1000 Genomes Project data alone versus jointly with the mother. 
Figure 7 compares the maximum achievable haplotype block 
sizes of any single individual haplotype assembly algorithm to 
IBD haplotype assembly; it demonstrates that larger haplotype 
blocks are achievable by assembling two individuals with a 
shared haplotype together rather than separately. 

3.6 Polyploid algorithm 

Finally, to evaluate the polyploid algorithm, we simulated three 
haplotypes at random and simulated reads from these haplo- 
types. The simulated reads were guaranteed to contain two 
SNPs (assuring they are useful for haplotype assembly) and 
given normally distributed insert sizes. The polyploid algorithm 
was run using both the greedy and probabilistic binning algo- 
rithms for deciding edge phasings. Figure 8 demonstrates two 



Table 1. The total FMPR, BFM and MEC scores aggregated across 
individuals HG00321, HG00577, HG01101, NA18861, NA19313, 
NA19740, NA20296 and NA20800 in the Pacific Biosciences data 





HapCompass MWER 


HapCompass MEC 


Levy 


HapCUT 


FMPR 


163 799 


169 385 


153433 


169 890 


BFM 


39 827 


40470 


38318 


41006 


MEC 


48 631 


49 591 


66299 


50164 



interesting results: (i) for a small number of reads, the quality 
of haplotype phasing is independent of the choice of binning 
method and (ii) that the probabilistic algorithm produces a 
more accurate phased solution than the greedy binning method 
for a large range of simulated read counts. 



4 DISCUSSION 

Diploid haplotype inference is still a difficult task, in part due to 
the exponentially many solutions given the input genotype or 
sequence reads. HapCompass is a proven framework for haplo- 
type assembly but there are a number of extensions that may 
improve results. For instance, we did not mention the usage of 
base call or read mapping quality scores in our computations. 
HapCompass can filter based on user-defined thresholds but a 
more elegant solution would be to convert the base call quality 
score for a particular allele call into a probability the base was 
called correctly. This probability can then define the contribution 
of weight to the edges of G c rather than the current weight 
contribution of 1 for each SNP allele called. Also we demon- 
strated in the Pacific Biosciences experiments that the choice of 
assembly method should be informed by the sequencing technol- 
ogy and desired result. The Levy et al. (2007) method mapped 



Haplotype size distribution for trio assembly 
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Fig. 7. Comparison between haplotype assembling the child individually 
versus with a parent. The haplotype size is number of SNPs in the com- 
ponent of Gc, which represents the maximum number of SNPs that may 
be phased together 



Table 2. Haplotype assembly results for the 454 data from 1000 Genomes Project NA12878 individual for chromosomes 1-22 and algorithms 
HapCompass MWER, HapCompass MEC, Levy et al. (2007) and HapCUT 



HapCompass MWER 




HapCompass MEC 




Levy 






HapCUT 






FMPR BFM 

64128 36 578 


MEC 
37597 


FMPR BFM 

68 623 39221 


MEC 
40269 


FMPR 

64789 


BFM 

39 832 


MEC 
40946 


FMPR 
65724 


BFM 

37 606 


MEC 

38 372 



Note: The full table is given in the section 1000 Genomes Project Results in Supplementary Appendix. 
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Polyploid haplotype assembly accuracy 




1B+02 1e+03 1e+04 1e+05 3e+05 

number of reads 

Fig. 8. Comparison of the percentage of correctly phased polyploid SNP 
pairs for the greedy and probabilistic binning algorithms for varying 
number of input reads 



more fragments error free than HapCompass but contained 
many more single base changes in fragments required to repro- 
duce the inferred haplotypes. Considering the Pacific Biosciences 
data has high error rates and generating an error-free read is 
unlikely, a solution with the minimum number of corrected 
errors is likely preferred over a solution that successfully maps 
more fragments without errors. 

The size of the haplotype blocks produced and, ultimately, the 
quality of the assembled haplotypes is a function of several fac- 
tors. The primary difficulty for obtaining large haplotype blocks 
is the small nature and lack of diversity of insert lengths. We 
demonstrated a novel modeling and computational method that 
begins to address this difficulty by exploiting shared IBD haplo- 
type structure. In general, assembling the haplotypes of related 
individuals has considerable benefits, which help overcome un- 
desirable properties of the sequencing data. The first benefit 
comes from the extra coverage on the shared haplotype, 
which helps in differentiating actual phasings from sequencing 
errors. However, the most notable advantage is being able to 
include more SNPs into the haplotype assembly, which helps 
extend the assembly (past regions of low read coverage for ex- 
ample). But, the major advances in block sizes will likely be the 
result of novel experimental procedures and technologies; for 
instance, not only do the single molecule sequencers promise 
larger read lengths, they also enable the inclusion of multiple 
and large insert lengths. 

Organisms having more than two sets of homologous chromo- 
somes are becoming the target of many research groups inter- 
ested in studying the genomics of disease, phylogenetics and 
evolution [Chen and Ni (2006); Leitch and Leitch (2008)]. 
Polyploidy occurs in human disease usually due to the duplica- 
tion of a particular chromosome, for example, in Edwards, Patau 
and Down syndrome. While far fewer mammalian organisms are 
polyploid, specific mammalian cells may undergo polyploidiza- 
tion, for example, in human liver hepatocytes [Gentric et al. 
(2012)]. In addition, polyploid organisms are ubiquitous in the 



Plant and Fungi clades, present in crops that we ingest, convert 
into bioenergy and feed to livestock. Understanding the gen- 
omics of both the desirable — e.g. increased crop yield — and un- 
desirable — e.g. susceptibility to disease — properties of plants 
may lead to critical advances in many research areas but requires 
untangling the polyploid genome and its variation. As more 
polyploid data becomes available, our approach may be used 
to infer haplotypes and begin to understand what effects haplo- 
type variation may influence. 
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